The aim of this notebook is to provide an introduction to the many other resources there are online for learning about Python, working with data and much much more! Also, we'll cover how you can pull in any of this code and run it here, and how to deal with annoying dependencies.

Contents here, or just overview.

Learning Programming

This course isn't designed to introduce you to programming in a serious way, but Jupyter notebooks are a great environment to learn programming in, and at least 51 different languages are supported. Which language you should learn depends on what you plan to do. For normal scientific programming, and beginners, Python is a good choice, thanks to a massive number of supported packages, ease of readability and active community. However, R is very popular among statisticians, fortran among physicisists and Haskell is used at Edinburgh to teach programming. A good comparison can be found here[meta]. We hope to have kernels available for these in this notebook environment soon(~August 2015).

There are many free resources for learning different programming languages online; some of which are [notebooks][books]. A classic textbook for learning programming in general is the Art and Craft of Programming, also available online for free. There are also many courses available either as free courseware from other universities or from Coursera (or similar): these are summarised [on this page under "Where to Learn"][meta].

Getting a notebook

So, as an example, say you've decided you want to follow this notebook on Learning Python. You could download it to your local computer, then navigate to the tree view and upload it from there, but that's a lot like hard work. You can just execute the following cell to pull it into that notebook to your home directory:

[meta]: https://www.metacademy.org/roadmaps/rgrosse/basic_programming[books]: https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks#programming-and-computer-science


In [20]:
%%bash
curl http://nbviewer.ipython.org/urls/bitbucket.org/amjoconn/watpy-learning-to-code-with-python/raw/3441274a54c7ff6ff3e37285aafcbbd8cb4774f0/notebook/Learn%20to%20Code%20with%20Python.ipynb > Learn-to-Code-with-Python.ipynb


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 75752  100 75752    0     0  97210      0 --:--:-- --:--:-- --:--:-- 97117

Taking apart this command:

  • First, the %%bash cell magic says to execute the following cell in a bash shell, independent of the notebook. So anything in the cell is going to be executed like a shell script on the command line.
  • Next, we're using the curl command, which takes a web address and prints the page source onto standard output (ie it would just print the web page source to the terminal, if it weren't for the rest of the command).
  • Since we want the page in a file we use the > operator to pipe the output of the curl command into a file.
  • And the file name we've chosen is Learning-to-code-Python.ipynb.

So now, if you find a notebook file hosted anywhere on the internet you can download it with curl using the address and put it in whatever file you like.

Getting a repository

However, if you wanted to pull in an entire repository the command is different, because we're going to use the Git version control system. So, for example, say you want the ipython minibook code, then you need to clone the repository. To do that you'll need the clone url, which is highlighted in the following image:

We would like to clone with https in this case, so the corresponding url is https://github.com/ipython-books/minibook-code.git. Now, we just need to use the following git command:


In [23]:
%%bash
git clone https://github.com/ipython-books/minibook-code.git


Cloning into 'minibook-code'...

Which has created a directory containing everything in that repository, including the entire revision history:


In [24]:
cd minibook-code/


/home/gngdb/minibook-code

In [25]:
ls


chapter2/  chapter3/  chapter4/  chapter5/  chapter6/  README.md

The following is an example of the git log (extremely short as the code in this repository has just been migrated from another project):


In [36]:
%%bash
git log


commit de93d0a33a8918c6ac490428048e1a8dfcc81e87
Author: Cyrille Rossant <cyrille.rossant@gmail.com>
Date:   Sun Sep 28 14:07:40 2014 +0200

    Added code.

commit 55bb3607ccf36536a7b8bfe99e54806932872a3e
Author: Cyrille Rossant <rossant@users.noreply.github.com>
Date:   Tue Oct 29 06:03:07 2013 -0700

    Initial commit

In [37]:
cd ..


/home/gngdb

Learning Python

The following cell will pull in the following repositories useful for learning Python and put them in a subdirectoy learning-python:


In [42]:
%%bash
mkdir learning-python
cd learning-python
git clone https://github.com/ehmatthes/intro_programming.git
git clone https://github.com/yoavram/CS1001.py.git


Cloning into 'intro_programming'...
Cloning into 'CS1001.py'...

Even more resources

Here are some other things that may be useful:

  • Miscellaneous Python resources:
    • pycrumbs: a wiki page populated with a massive number of useful links (e.g. if you wanted to program something for Google Glass, it's there).
    • pythonidae: an index of tools for scientific programming in Python for practically any task.
    • awesome-python: an index of Python tools for everything.

Dealing with Data in Python

Python is a popular language for Science and data science, and many of the most popular notebooks are data science using Python. Therefore, there are also some very good introductions to data science available as IPython notebooks.

The following cell will populate a repository python-data-science with some of the best resources for learning data science techniques in Python:


In [45]:
%%bash
mkdir python-data-science
cd python-data-science
git clone https://github.com/nborwankar/LearnDataScience.git Learn-Data-Science
curl https://raw.githubusercontent.com/mwaskom/Psych216/master/week6_tutorial.ipynb > A-Tutorial-on-Model-Reliability.ipynb
mkdir Holoviews-tutorials
git clone https://github.com/ioam/holoviews.git
mv holoviews/doc/Tutorials/* Holoviews-tutorials/
rm -rf holoviews
git clone https://bitbucket.org/hrojas/learn-pandas.git An-Introduction-To-Pandas
git clone https://github.com/amueller/tutorial_ml_gkbionics.git Simple-Machine-Learning-with-Scikit-Learn
git clone https://github.com/mwaskom/Psych216.git Statistics-and-Data-Analysis-in-Python
git clone https://github.com/ogrisel/parallel_ml_tutorial.git Paralel-Machine-Learning-with-Scikit-Learn
git clone https://github.com/ResearchComputing/Meetup-Fall-2013.git
mkdir Python-for-Data-Analysis
mv Meetup-Fall-2013/python/* Python-for-Data-Analysis/
rm -rf Meetup-Fall-2013


Cloning into 'Learn-Data-Science'...
Checking out files: 100% (170/170), done.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  228k  100  228k    0     0   305k      0 --:--:-- --:--:-- --:--:--  304k
Cloning into 'holoviews'...
Cloning into 'An-Introduction-To-Pandas'...
Cloning into 'Simple-Machine-Learning-with-Scikit-Learn'...
Cloning into 'Statistics-and-Data-Analysis-in-Python'...
Cloning into 'Parralel-Machine-Learning-with-Scikit-Learn'...
Cloning into 'Meetup-Fall-2013'...

Recommendations

The following are recommendations by me, Gavin Gray, and not necessarily the opinion of the course, Edinburgh University or anyone else. I just wanted to provide some recommendations of useful resources and tools if you would like to approach some more advanced concepts in data science.

First, I have to recommend Probabilistic Programming and Bayesian Methods for Hackers. If you're not familiar with probabilistic programming, then the title will make no sense to you. Probabilistic programming is a way of separating the model building part of data science from what is called the inference part of data science, which usually is the algorithm that actually makes the prediction. Usually, you'd like your model to be a little bit more complicated, because the real world is a complicated place, but to write the algorithm that you'd need to be able to make predictions using this model would take a long time. There are some drawbacks to this method, such as problems with scaling to large datasets, but these are discussed well in the book itself, along with what probabilistic programming is and how it works. Practically speaking, if this is applied well, it is a very good way of doing Bayesian statistics and I would say it's a very good way to attack the kinds of problems people would normally use statistics for. Finally, note that if your model is correct and you can do inference then you will make the best predictions possible (given the data); for an example of this, see Iain Murray's Dark Worlds blog post.

If you encounter terms that you don't understand, or a concept you've heard of but haven't quite got the hang of yet, then probably the best place to go is Metacademy. The idea is that it is a "package manager for knowledge"; a package manager being something that keeps track of packages you've installed (typically on linux) and makes sure that you've already installed everything that a given package depends on to work. Applied to knowledge, that means you get a list of dependencies you can tick off before you reach the thing you want to understand, rather than reading the same page in a textbook over and over wondering why you don't understand a step in the reasoning. Also, they only link to other resources, rather than trying to reinvent the wheel. So, you get the best resource for a given topic, which is often a pdf of open access lecture notes that just happens to be written very well, or a pre-print textbook that you can download for free. And, they give multiple resources so you can try the second if you don't like the style in the first.

I fully endorse this list of notebooks on various topics. Particularly the parallel machine learning tutorial (which is added if you ran the cell above for python-data-science) was very useful to me when I was working on a biological data science project where we had to speed up processing by parallelising. In addition to that, since Oliver Grisel is a scikit-learn dev he shows how to use all of the scikit-learn tools quickly and efficiently, and highlights some potential pitfalls. Also, the notebook on d3.js may come in useful if you would like one day to make something like this.

If you would like to look at the code from research papers, one of the best resources available (and getting better rapidly) is GitXiv. The idea is to match papers with code that will replicate the results of the paper, thus leading to reproducible science. One nice thing about the way they are doing it is the first person to reproduce the results of the paper can simply link their code to the paper, so even if the original authors don't release their code, the code can still make it into the open. Many of these have IPython notebooks.

Finally, (and this is very biased) there are some great notebooks on deep learning available; making it relatively easy to do some difficult things. For instance, if you saw the deep dream blog post by Google recently, did you know that you can grab a notebook with all of the code to make those images? More recently, there has been work redrawing images in the style of other images (usually famous paintings) and there are notebooks on how to do this. Playing with these is fun, but you will have problems running them on our server, or on your own machine. To run these, you really need a computer with a powerful GPU and the appropriate drivers set up. One easy way to do this would be using Amazon's EC2, and this would involve setting up the remote server with an IPython notebook server and using an ssh tunnel to access it on your local machine. That's beyond our scope, but there are tutorials online that will cover it.


In [55]:
%%bash
mkdir recommendations
cd recommendations
git clone https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers.git
git clone https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-Python.git
git clone https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition.git
git clone https://github.com/jseabold/538model.git
curl https://dato.com/learn/gallery/notebooks/graph_analytics_movies.ipynb > Seven-Degrees-of-Kevin-Bacon.ipynb
git clone https://github.com/ellisonbg/talk-2014-strata-sc.git
curl http://norvig.com/ipython/xkcd1313.ipynb > Norvig-regex-golf.ipynb


Cloning into 'Probabilistic-Programming-and-Bayesian-Methods-for-Hackers'...
Cloning into 'Kalman-and-Bayesian-Filters-in-Python'...
Cloning into 'Mining-the-Social-Web-2nd-Edition'...
Cloning into '538model'...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  663k  100  663k    0     0   322k      0  0:00:02  0:00:02 --:--:--  322k
Cloning into 'talk-2014-strata-sc'...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 39075  100 39075    0     0  47695      0 --:--:-- --:--:-- --:--:-- 47652

Dependencies

So you've looked through the projects that are available and cloned one you think looks interesting and you want to start coding. But, you start up the notebook and this happens:


In [1]:
import pymc


---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-5f262cfcb99b> in <module>()
----> 1 import pymc

ImportError: No module named 'pymc'

So what's happened is we've tried to import the package PyMC, but we haven't installed it yet so we get an ImportError. Normally, if you were on you're own computer you could just run the following:


In [3]:
!pip3 install pymc


Collecting pymc
  Using cached pymc-2.3.4.tar.gz
Building wheels for collected packages: pymc
  Running setup.py bdist_wheel for pymc
  Stored in directory: /home/gngdb/.cache/pip/wheels/2a/2e/f2/bc40a944cdf707740a33c85e81dda55b4667f783d4850f84ea
Successfully built pymc
Installing collected packages: pymc
Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/pip/basecommand.py", line 211, in main
    status = self.run(options, args)
  File "/usr/local/lib/python3.4/dist-packages/pip/commands/install.py", line 311, in run
    root=options.root_path,
  File "/usr/local/lib/python3.4/dist-packages/pip/req/req_set.py", line 646, in install
    **kwargs
  File "/usr/local/lib/python3.4/dist-packages/pip/req/req_install.py", line 803, in install
    self.move_wheel_files(self.source_dir, root=root)
  File "/usr/local/lib/python3.4/dist-packages/pip/req/req_install.py", line 998, in move_wheel_files
    isolated=self.isolated,
  File "/usr/local/lib/python3.4/dist-packages/pip/wheel.py", line 339, in move_wheel_files
    clobber(source, lib_dir, True)
  File "/usr/local/lib/python3.4/dist-packages/pip/wheel.py", line 310, in clobber
    ensure_dir(destdir)
  File "/usr/local/lib/python3.4/dist-packages/pip/utils/__init__.py", line 71, in ensure_dir
    os.makedirs(path)
  File "/usr/lib/python3.4/os.py", line 237, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.4/dist-packages/pymc'

As you can see, this fails because we don't have sufficient privileges. On your own computer you could use sudo. We can get around this by installing to our user account with the --user flag:


In [4]:
!pip3 install --user pymc


Collecting pymc
Installing collected packages: pymc
Successfully installed pymc

Unfortunately, importing will still fail:


In [5]:
import pymc


---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-5-5f262cfcb99b> in <module>()
----> 1 import pymc

ImportError: No module named 'pymc'

When Python tries to install a package, it looks in a few places, and it turns out that when you install something with the --user flag it is installed to your user account at ~/.local/:


In [11]:
!ls ~/.local/lib/python3.4/site-packages


pymc  pymc-2.3.4.dist-info

And Python is only looking in:


In [12]:
import sys
sys.path


Out[12]:
['',
 '/usr/local/dds-notebooks',
 '/usr/lib/python3.4',
 '/usr/lib/python3.4/plat-x86_64-linux-gnu',
 '/usr/lib/python3.4/lib-dynload',
 '/usr/local/lib/python3.4/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/usr/local/lib/python3.4/dist-packages/IPython/extensions']

However, if we add the above path to Python's path, we will be able to import the package:


In [18]:
sys.path.append("/home/gngdb/.local/lib/python3.4/site-packages/")

In [19]:
import pymc

Although, this will only persist in this notebook session after we've run the sys.path.append command. In a new notebook, we'll have to run the sys.path.append command again. But, this will include any packages we've installed using pip install --user, so it is a fairly useful way to install extra packages. For some packages, this can also be done by just cloning the git repository and adding this directory to your path using sys.path.append again.